In [ ]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')

Exploring the Groningen Meaning Bank Dataset

This notebook relates to the Groningen Meaning Bank - Modified dataset. The dataset contains tags for parts of speech and named entities in a set of sentences predominantly from news articles and other factual documents. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

In this notebook, we load, explore, clean and visualize the gmb_subset_full.txt dataset and generate a cleaned data file gmb_subset_full_cleaned.csv. The cleaned dataset is prepared for further analysis in the following notebooks.

Table of Contents:

0. Prerequisites

Before you run this notebook complete the following steps:

  • Insert a project token
  • Import required modules
Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @jhidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project’s resources:

  • Click on More -> Insert project token in the top-right menu section ws-project.mov
  • This should insert a cell at the top of this notebook similar to the example given above.

    If an error is displayed indicating that no project token is defined, follow these instructions.

  • Run the newly inserted cell before proceeding with the notebook execution below

Import required packages

Import and configure the required packages.

In [ ]:
# Install packages
!pip install wordcloud
!pip install cufflinks

# Clear output of messy cells
from IPython.display import clear_output
clear_output()
In [ ]:
# Define required imports
import io
import pandas as pd
import numpy as np
import nltk
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import cufflinks as cf
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

1. Read the Raw Data

We start by reading in the raw dataset, displaying the first few rows of the dataframe, and taking a look at the columns and column types present.

In [ ]:
# Write a function to load data asset into notebook
def load_data_asset(data_asset_name):
    r = project.get_file(data_asset_name)
    if isinstance(r, list):
        bio = [ handle['file_content'] for handle in r if handle['data_file_title'] ==  data_asset_name][0]
        bio.seek(0)
        return io.TextIOWrapper(bio, encoding='utf-8')
    else:
        r.seek(0)
        return io.TextIOWrapper(r, encoding='utf-8')

# Read in gmb_subset_full.txt file
tf = load_data_asset('gmb_subset_full.txt')
rows = []
for line in tf.readlines():
    rows.append(line.rstrip('\n').split(' '))
print('Number of rows read: {}'.format(len(rows)))
In [ ]:
# Save the data as DataFrame
data = pd.DataFrame(rows, columns=['term', 'postags', 'entitytags'])

On IBM Developer DAX Groningen Meaning Bank page, it mentions that there are 1,314,115 documents in total, including the blank space which is used to separate each sentence. The print value of data shape proves that we read in the data file correctly.

2. Data Preprocessing

In this section, we prepare the data for training. Data preprocessing is oftenly an important step in the data exploration process. Many issues might occur in the dataset, such as unexpected values, wrongly parsed data, missing values, etc. Analyzing data that has not been carefully screened for such problems can lead to misleading results. We will complete the data cleaning process in this section. After that, we will generate an additional sentence_id column, which provides us information on which sentence each term falls in. This process is fundamental for machine learning analysis in the next notebook.

Reference of Data Pre-processing

Data Cleaning

First, we would like to inspect if there are multiple missing values (empty lines) and parsing issues.

Since there is an empty at the end of each sentence, we would like to drop it.

In [ ]:
# Drop all rows which have 'NaN' values
data.dropna(inplace = True)

The dataframe has 1,256,664 nonempty rows of terms, postags and tags.

In [ ]:
data.shape

Check if there are any rows didn't parsed the data as expected.

In [ ]:
if len(data[data['term'].str.contains('O\n')]) == 0:
    print('No parsing issues were found.')
if len(data[data['term'].str.contains('\n')]) == 0:
    print('No parsing issues were found.')

Named-entity tags The annotation scheme for named entities in the GMB distinguishes the following eight classes:

  • Person (PER) - Person entities are limited to individuals that are human or have human characteristics, such as divine entities.

  • Location (GEO) - Location entities are limited to geographical entities such as geographical areas and landmasses, bodies of water, and geological formations.

  • Organization (ORG) - Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure.

  • Geo-political Entity (GPE) - GPE entities are geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a city, a nation, its region, its government, or its people (LOC•ORG).

  • Artifact (ART) - Artifacts are limited to manmade objects, structures and abstract entities, including buildings, facilities, art and scientific theories.

  • Event (EVE) - Events are incidents and occasions that occur during a particular time.

  • Natural Object (NAT) - Natural objects are entities that occur naturally and are not manmade, such as diseases, biological entities and other living things.

  • Time (TIM) - Time entities are limited to references to certain temporal entities that have a name, such as the days of the week and months of a year. For all other temporal expressions the tagging layer timex is used (see below).

  • Other (O) - Other entities include all other words which do not fall in any of the categories above.

We would like to inspect whether there are any unexpected tags other than the ones mentioned above. If so, there might be some parsing issues we need to resolve.

In [ ]:
# Check entitytags and how many values fall under each entitytag
data['entitytags'].value_counts()

In the columns, the entity tags provide important information on types of the terms. These entity tags cover 8 types of named entities: persons, locations, organizations, geo-political entities, artifacts, events, natural objects, time, as well as a tag for ‘no entity’. The entity types furthermore may be tagged with either a “B-” tag or “I-” tag. A “B-” tag indicates the first term of a new entity (or only term of a single-term entity), while subsequent terms in an entity will have an “I-” tag. For example, “New York” would be tagged as ["B-GEO", "I-GEO"] while “London” would be tagged as "B-GEO".

In [ ]:
entity_labels = ['O', 'B-GEO', 'B-GPE', 'B-TIM', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG',
       'I-TIM', 'I-GEO', 'B-ART', 'I-ART', 'I-GPE', 'B-EVE', 'I-EVE',
       'B-NAT', 'I-NAT']
if sorted(data['entitytags'].unique()) == sorted(entity_labels):
    print('No mislabeled issues were found.')

Inspecting the data, we can see there is no data mislabeled into categories that are not included in above definitions.

Drop 'O' categorized words

In each sentence, there are many terms categorized as O than the other categories. O entity tag is not very informative because it only means "other". Keeping it makes us have hard time to inspect on the other entity tags. we would like to create a new data frame which drops the O categories so that we can investigate on the meaningful entity tags.

In [ ]:
# Select the index that has entitytags as Other ('O')
otherTag = data[data['entitytags'] == 'O'].index
# Create a new dataframe and save the 'O' dropped version
tag_df = pd.DataFrame(data.drop(otherTag))

Inspect entity tags and pos tags

After dropping the O entity tags, we want to inspect entity and pos tags. Doing so helps us know the uniques, frequencies of each tag, and also see if there are any mis-categorized terms.

In [ ]:
# Inspect the count of each entity tag
# including the uniques, top frequency words
entitytag = tag_df.groupby("entitytags")['term']
entitytags = entitytag.describe()
entitytags

The .describe() method provides descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution. The method is helpful because it answers questions such as, which entity tag appears most frequently, what are the most common words in each entity tag, or how unique the words are? These kinds of information are important at the data exploration stage, in order to gain basic knowledge on the dataset. Reference of .describe() method

Similarly, let's inspect the pos tags column as well, see if there are any values mis-categorized.

In [ ]:
postag = data.groupby("postags")['term']
postags = postag.describe()
postags

There are categories such as $, ., ,, etc. The pos-tags are not labeled wrong. Each punctuation are labeled as what it is in the pos tags. Consider we use punctuations everyday, it's natural to see them in here. Each sentence contains punctions and we did not strip them for the sake of content accuracy.

Add additional column

From the above inspections, all signs are clear. We decide to go ahead and add another column to the dataset.

In [ ]:
# Join all 'term' to a large string
data_text = ' '.join(data['term'].tolist())
# Split data_text into sentences
text_list = nltk.tokenize.sent_tokenize(data_text)
text_list[:5]
In [ ]:
# Add additional column that indicates 
# which sentence each word belongs to.
sentence_count = 0
sentence_count_list = []
# Loop through each term in data
for i in range(len(data['term'])):
    # Sentence ends with period
    # If it is period, then go to next sentence
    if data['term'].iloc[i] == '.':
        sentence_count_list.append(sentence_count)
        sentence_count = sentence_count+1
    # Else we are still in the current sentence
    else:
        sentence_count_list.append(sentence_count)

The original dataset includes three columns: terms, pos tags and entity tags; while the cleaned dataset will include four columns: terms, pos tags, entity tags, in addition with sentence position index that outlines which sentence each term falls in. For example, the first sentence is 'Masked assailants with grenades and automatic weapons attacked a wedding party in southeastern Turkey , killing 45 people and wounding at least six others .', so the sentence index of term Masked, assailants, with... will all be 0.

In [ ]:
# Add sentence count column to dataframe
data['sentence_id'] = sentence_count_list
data.head(5)

3. Data Visualization

In this section, we will visualize the data in four levels: the sentence level, the entity tags level, the pos tags level and the terms level. Visualizing in four perspective helps us have more tangible information on the data.

Investigate number of tokens in each sentence

We would like to know at the sentence level, how many token per sentence on average? Doing so we can see whether the dataset is balanced. The sentences should usually be about 20 to 30 words long.

In [ ]:
# Save sentence length to list
sentence_len = data['sentence_id'].value_counts().tolist()
# Plot sentence by length
plt.hist(sentence_len, bins=50)
plt.title('Number of words per sentence')
plt.xlabel('Sentence length in words')
plt.ylabel('Number of sentences')
plt.show()

On average most sentences contain 20 words. The tokens per sentence distribution is roughly symmetric, so the data is in good shape.

Visualize Entity tag Distribution

In [ ]:
# Visualize entity tags distribution
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
tag_df.groupby('entitytags').count()['term'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.8,
                                                           title='Entity Tags Distribution', xTitle='Entity tag name')

From the Entity Tags Distribution, we can see that B-GEO is the most frequently labeled entity tag. Which makes sense, because location entities are oftenly appear more frequently in the sentences.

Now, we would like to also visualize the count, unique, frequency data from the entity tags table. It tells us how many words are unique in each entity tag and how repetitive the document is.

In [ ]:
# Visualize entity tag frequency, uniques, counts
matplotlib.rcParams['figure.dpi'] = 150
plt.figure(figsize=(50,15))
entitytags.plot.barh()
plt.xticks(rotation=50)
plt.xlabel("Numbers")
plt.ylabel("Entity Tag")
plt.show()

Inspecting the counts of each entity tag and the uniques, top frequency words, we conclude that roughly I-PER, I-ORG, B-PER, B-ORG have more unique words in each entity tags. It means that the terms have more variety. Following notebooks will investigate on the correlation and weights of each tag.

Visualize Pos tag Distribution

In [ ]:
# Visualize pos tags distribution
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
tag_df.groupby('postags').count()['term'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.8,
                                                           title='Pos Tag Distribution', xTitle='Pos tag name')

The top pos tags are: NNP (Proper noun, singular), JJ (Adjective), CD (Cardinal number), NN (Noun, singular or mass), IN (Preposition or subordinating conjunction), NNPS (Proper noun, plural), NNS (Noun, plural).

Reference of Part-of-speech Tags

Analyze word frequencies

Word cloud is a method for visually presenting text data. They are popular for text analysis because they make it easy to spot word frequencies. The more frequent the word is used, the larger and bolder it is displayed. Word cloud can add clarity during text analysis in order to effectively communicate the data results. Moreover, word cloud can also reveal patterns in your responses that may guide future analysis.

Reference of Word Cloud

In [ ]:
# Group B_GEO and I_GEO entitytags
B_GEO = data[data['entitytags'] == 'B-GEO']
I_GEO = data[data['entitytags'] == 'I-GEO']
GEO = [B_GEO, I_GEO]
GEO = pd.concat(GEO)
GEO_text = ' '.join(GEO['term'].tolist())

# Group B_TIM and I_TIM entitytags
B_TIM = data[data['entitytags'] == 'B-TIM']
I_TIM = data[data['entitytags'] == 'I-TIM']
TIM = [B_TIM, I_TIM]
TIM = pd.concat(TIM)
TIM_text = ' '.join(TIM['term'].tolist())

# Group B_ORG and I_ORG entitytags
B_ORG = data[data['entitytags'] == 'B-ORG']
I_ORG = data[data['entitytags'] == 'I-ORG']
ORG = [B_ORG, I_ORG]
ORG = pd.concat(ORG)
ORG_text = ' '.join(ORG['term'].tolist())

# Group B_PER and I_PER entitytags
B_PER = data[data['entitytags'] == 'B-PER']
I_PER = data[data['entitytags'] == 'I-PER']
PER = [B_PER, I_PER]
PER = pd.concat(PER)
PER_text = ' '.join(PER['term'].tolist())

# Create and generate a word cloud image:
wordcloud_GEO = WordCloud(collocations = False).generate(GEO_text)
wordcloud_TIM = WordCloud(collocations = False).generate(TIM_text)
wordcloud_ORG = WordCloud(collocations = False).generate(ORG_text)
wordcloud_PER = WordCloud(collocations = False).generate(PER_text)
In [ ]:
plt.figure(figsize=(50, 50))

#subplot(r,c) provide the no. of rows and columns
f, axarr = plt.subplots(2,2) 

# use the created array to output your multiple images. In this case I have stacked 4 images vertically
axarr[0,0].imshow(wordcloud_GEO, interpolation='bilinear', aspect='auto')
axarr[0,1].imshow(wordcloud_TIM, interpolation='bilinear', aspect='auto')
axarr[1,0].imshow(wordcloud_ORG, interpolation='bilinear', aspect='auto')
axarr[1,1].imshow(wordcloud_PER, interpolation='bilinear', aspect='auto')

See the four word clouds of GEO, TIM, ORG, PER. A word cloud is a cluster, collection of words depicted in different sizes. The larger the size, the more frequent the word appears in the document. In the GEO word cloud, Iraq, State, Iran, China, and United are the most popular words. In the ORG word cloud, United, Taleban, Nation, and Qaida seems to be more frequently mentioned. From this, we can infer the documents mainly discuss the political issue between Taleban and the United States under the presidency of George W. Bush.

Reference: The Washington Post - Bush announces strikes against Taliban

4. Save the Cleaned Data

Finally, we save the cleaned dataset as a project asset for later re-use. You should see an output like the one below if successful:

{'file_name': 'gmb_subset_full_cleaned.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'gmbgalleryprojectdev-donotdelete-pr-...',
 'asset_id': '...'}
In [ ]:
project.save_data("gmb_subset_full_cleaned.csv", data.to_csv(float_format='%g'), overwrite=True)

Next steps

  • Close this notebook.
  • Open the Part 2 - Named Entity Recognition notebook to explore the cleaned dataset.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up